Self-Tuning Database Systems: A Decade of Progress

نویسندگان

  • Surajit Chaudhuri
  • Vivek R. Narasayya
چکیده

In this paper we discuss advances in self-tuning database systems over the past decade, based on our experience in the AutoAdmin project at Microsoft Research. This paper primarily focuses on the problem of automated physical database design. We also highlight other areas where research on self-tuning database technology has made significant progress. We conclude with our thoughts on opportunities and open issues. 1. HISTORY OF AUTOADMIN PROJECT Our VLDB 1997 paper [26] reported our first technical results from the AutoAdmin project that was started in Microsoft Research in the summer of 1996. The SQL Server product group at that time had taken on the ambitious task of redesigning the SQL Server code for their next release (SQL Server 7.0). Ease of use and elimination of knobs was a driving force for their design of SQL Server 7.0. At the same time, in the database research world, data analysis and mining techniques had become popular. In starting the AutoAdmin project, we hoped to leverage some of the data analysis and mining techniques to automate difficult tuning and administrative tasks for database systems. As our first goal in AutoAdmin, we decided to focus on physical database design. This was by no means a new problem, but it was still an open problem. Moreover, it was clearly a problem that impacted performance tuning. The decision to focus on physical database design was somewhat ad-hoc. Its close relationship to query processing was an implicit driving function as the latter was our area of past work. Thus, the paper in VLDB 1997 [26] described our first solution to automating physical database design. In this paper, we take a look back on the last decade and review some of the work on Self-Tuning Database systems. A complete survey of the field is beyond the scope of this paper. Our discussions are influenced by our experiences with the specific problems we addressed in the AutoAdmin project. Since our VLDB 1997 paper was on physical database design, a large part of this paper is also devoted to providing details of the progress in that specific sub-topic (Sections 2-6). In Section 7, we discuss briefly a few of the other important areas where self-tuning database technology have made advances over the last decade. We reflect on future directions in Section 8 and conclude in Section 9. 2. AN INTRODUCTION TO PHYSICAL DATABASE DESIGN 2.1 Importance of Physical Design A crucial property of a relational DBMS is that it provides physical data independence. This allows physical structures such as indexes to change seamlessly without affecting the output of the query; but such changes do impact efficiency. Thus, together with the capabilities of the execution engine and the optimizer, the physical database design determines how efficiently a query is executed on a DBMS. The first generation of relational execution engines were relatively simple, targeted at OLTP, making index selection less of a problem. The importance of physical design was amplified as query optimizers became sophisticated to cope with complex decision support queries. Since query execution and optimization techniques were far more advanced, DBAs could no longer rely on a simplistic model of the engine. But, the choice of right index structures was crucial for efficient query execution over large databases. 2.2 State of the Art in 1997 The role of the workload, including queries and updates, in physical design was widely recognized. Therefore, at a high level, the problem of physical database design was for a given workload, find a configuration, i.e. a set of indexes that minimize the cost. However, early approaches did not always agree on what constitutes a workload, or what should be measured as cost for a given query and configuration. Papers on physical design of databases started appearing as early as 1974. Early work such as by Stonebraker [63] assumed a parametric model of the workload and work by Hammer and Chan [44] used a predictive model to derive the parameters. Later papers increasingly started using an explicit workload [40],[41],[56]. An explicit workload can be collected using the tracing capabilities of the DBMS. Moreover, some papers restricted the class of workloads, whether explicit or parametric, to single table queries. Sometimes such restrictions were necessary for their proposed index selection techniques to even apply and in some cases they could justify the goodness of their solution only for the restricted class of queries. All papers recognized that it is not feasible to estimate goodness of a physical design for a workload by actual creation of indexes and then executing the queries and updates in the workload. Nonetheless, there was a lot of variance on what would be the model of cost. Some of the papers took the approach of doing the comparison among the alternatives by building their own cost model. For columns on which no indexes are present, they built histograms and their custom cost model computed the selectivity of predicates in the queries by using the histograms. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Database Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM. VLDB ’07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09. Another set of papers, starting with [40], used the query optimizer’s cost model instead of building a new external cost model. Thus the goodness of a configuration for a query was measured by the optimizer estimated cost of the query for that configuration. In this approach, although histograms still needed to be built on columns for which no indexes existed, no new cost model was necessary. This approach also required metadata changes to signal to the query optimizer presence of (fake) indexes on those columns. A concern in this approach is the potential impact on performance on the server and therefore there was a need to minimize the number of optimizer calls [40,41]. Some of the techniques to reduce optimizer calls introduced approximations, and thus led to lack of full fidelity with the optimizer’s cost model. The hardness result for selecting an optimal index configuration was shown by Shapiro [60]. Therefore, the challenge was similar to that in the area of query optimization – identifying the right set of heuristics to guide the selection of physical design. One set of papers advocated an approach based on rule-based expert systems. The rules took into account query structures as well as statistical profiles and were “stand-alone” applications that recommended indexes. A tool such as DEC RdbExpert falls in this category. Rozen and Shasha [56] also used an external cost model but their cost model was similar to that of a query optimizer. They suggested a search paradigm that used the best features of an individual query (using heuristics, without optimizer calls) and restricting the search to the union of those features. The latter idea of using best candidates of individual queries as the search space is valuable, as we will discuss later. The “stand-alone” approaches described above suffered from a key architectural drawback as pointed out by [40], the first paper to propose an explicit workload model and also to use the query optimizer for estimating costs. This paper argued that the optimizer’s cost model must be the basis to evaluate the impact of a physical design for a query. It also proposed building database statistics for non-existent indexes and making changes to system catalog so that optimizers can estimate costs for potential physical design configurations. Despite its key architectural contributions, there were several important limitations of this approach as will be discussed shortly. 3. REVIEW OF VLDB 1997 PAPER 3.1 Challenges The AutoAdmin project started considering the physical design problem almost a decade after [40]. During this decade, tremendous progress was made on the query processing framework. The defining application of this era was decisionsupport queries over large databases. The execution engine supported new logical as well as physical operators. The engines used indexes in much more sophisticated ways; for example, multiple indexes per table could be used to process selection queries using index intersection (and union). Indexes were also used to avoid accessing the base table altogether, effectively being used for sequential scans of vertical slices of tables. These are known as “covering indexes” for queries, i.e., when a covering index for a query is present, the query could avoid accessing the data file. Indexes were used to eliminate sorts that would otherwise have been required for a GROUP BY query. The optimization technology was able to handle complex queries that could leverage these advances in execution engine. The workload that represented usage of the system often consisted of many queries and stored procedures coming from a variety of applications and thus no longer limited to a handful of queries. While this new era dramatically increased the importance of the physical database design problem, it also exposed the severe limitations of the past techniques. The “expert system” based approach was no longer viable as building an external accurate model of index usage was no longer feasible. Therefore, the approach taken in [40] to use the optimizer’s cost model and statistics was the natural choice. However, even there we faced several key gaps in what [40] offered. First, the necessary ingredients for supporting the needed API functionality in a client-server architecture was not discussed. Specifically, given that the databases for decision support systems were very large and had many columns, creating statistics using traditional full scan techniques was out of question for these databases. Second, the new execution engines offered many more opportunities for sophisticated index usage. Thus the elimination heuristics to reduce the search space of potential indexes (e.g., at most one index per table) was no longer adequate. Third, it was imperative that multi-column indexes were considered extensively as they are very important to provide “covering indexes”. The search strategy of [40] did not consider multi-column indexes as they were of relatively low importance for execution engines and application needs of a decade ago. Finally, the scalability of the tool with respect to the workload size was also quite important as traces, either generated or provided by DBAs, could consist of many (often hundreds of thousands of) queries and updates, each of which can be quite complex. 3.2 Key Contributions The first contribution of our AutoAdmin physical design project was to support creation of a new API that enabled a scalable way to create a hypothetical index. This was the most important server-side enhancement necessary. A detailed description of this interface, referred to as “what-if” (or hypothetical) index, appeared in [27] (see Figure 1). The key aspects are: (1) A “Create Hypothetical Index” command that creates metadata entry in the system catalog which defines the index. (2) An extension to the “Create Statistics” command to efficiently generate the statistics that describe the distribution of values of the column(s) of a what-if index via the use of sampling [25],[20]. Database Server Create hypothetical physical design Create statistics Define configuration Optimize query Query Execution Plan Physical Database Design Tool A related requirement was use of an optimization mode that enabled optimizing a query for a selected subset of indexes (hypothetical or actually materialized) and ignoring the presence of other access paths. This too was important as the alternative Figure 1. “What-if” analysis architecture for physical database design. would have been repeated creation and dropping of what-if indexes, a potentially costly solution that was used by [40]. This is achieved via a “Define Configuration” command followed by an invocation of the query optimizer. The importance of this interface went far beyond just automated physical design. Once exposed, it also made the manual physical design tuning much more efficient. A DBA, who wanted to analyze the impact of a physical design change, could do so without disrupting normal database operations. The next contribution was the key decision of defining the search space as consisting of the best configuration(s) for each query in the workload, where the best configuration itself is the one with lowest optimizer estimated cost for the query. Intuitively, this leverages the idea (see also [41]) that an index that is not part of an optimal (or close to optimal) configuration for at least one query, is unlikely to be optimal for the entire workload. Unlike [41], the selection of a set of candidate indexes on a per-query basis is done in AutoAdmin in a cost-based manner keeping the optimizer in the loop. This candidate selection step is key to scalable search. Our VLDB 1997 paper also presented a set of optimizations for obtaining the optimizer estimated cost of a query for a given configuration (denoted by Cost (Q,C)) without having to invoke the query optimizer. The essential idea was to show how Cost (Q,C) could be derived from the costs of certain important subsets of C. Given the richness of the query processing capabilities of DBMS engines, a key challenge, addressed in [26] was defining what configurations should be considered atomic for a given query. By caching the results of the optimizer calls for atomic configurations, optimizer invocations for several other configurations for that query could be eliminated (often by an order of magnitude). Finally, a search paradigm that was able to scale with the large space of multi-column indexes was proposed. The idea was to iteratively expand the space of multi-column indexes considered by picking only the winning indexes of one iteration and augmenting the space for the next iteration by extending the winners with an additional column. Intuitively, this exploits the idea that a two-column index (on say (A,B)) is unlikely to be beneficial for a workload unless the single column index on its leading column (i.e., index on (A)) is beneficial. The above key ideas formed the basis of the Index Tuning Wizard that shipped in Microsoft SQL Server 7.0 [29], the first tool of its kind in a commercial DBMS. In the subsequent years, we were able to refine and improve our initial design. 4. INCORPORATING OTHER PHYSICAL DESIGN STRUCTURES The VLDB 1997 paper [26] focused on how to recommend indexes for the given workload. Today’s RDBMSs however support other physical design structures that are crucial for workload performance. Materialized views are one such structure that is widely supported and can be very effective for decision support workloads. Horizontal and vertical partitioning are attractive since they provide the ability to speed up queries with little or no additional storage and update overhead. The large additional search space introduced by these physical design structures requires new methods to deal with challenges in scalability. In this section we describe the significant extensions to the search architecture of [26] for incorporating materialized views and partitioning (horizontal and vertical). We begin with a brief review of materialized views and partitioning and the new challenges they introduce. 4.1 Physical Design Structures 4.1.1 Materialized Views A materialized view (MV) is a more complex physical design structure than an index since a materialized view may be defined over multiple tables, and can involve selection, projection, join and group by. This richness of structure of MVs makes the problem of selecting materialized views significantly more complex than that of index selection. First, for a given query (and hence workload) the space of materialized views that must be considered is much larger than the space of indexes. For example, MVs on any subset of tables referenced in the query may be relevant. For each such subset many MVs with different selection conditions and group by columns may need to be considered. Furthermore, a materialized view itself can have clustered and non-clustered indexes defined on it. Finally, if there are storage and update constraints, then it is important to consider materialized views that can serve multiple queries. For example, if there are two candidate multi-table MVs, one with a selection condition Age BETWEEN 25 and 35 and another with the selection condition Age BETWEEN 30 and 40, then a MV with the selection condition Age BETWEEN 25 and 40 can be used to replace the above two materialized views but with potentially reduced storage and update costs. The techniques for searching the space of MVs in a scalable manner are of paramount importance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Self-tuning Zone PID Controller for Temperature Control via a PLC code

S7-1200 with Tia Portal technology has become a Standard function of distributed controlsystems. Self-Tuning methods belong to Programmable Controllers (PLC) techniques. PLCtechniques contain software packages for advanced control based on mathematical methods. S7-1200 tools are designed to increase the Process Capacity, yield, and quality of products. Most ofthe present time digital industry r...

متن کامل

Application to Adaptive Control to Synchronous Machine Excitation

Self-tuning adaptive control technique has the advantage of being able to track the system operating conditions so that satisfactory control action can always be produced. Self-tuning algorithms can be implemented easily. Because the power systems are usually time varying non-linear systems and their parameters vary, adaptive controllers are very suitable for power systems. Characteristics of a...

متن کامل

Architecture of Automated Database Tuning Using SGA Parameters

Business Data always growth from kilo byte, mega byte, giga byte, tera byte, peta byte, and so far. There is no way to avoid this increasing rate of data till business still running. Because of this issue, database tuning be critical part of a information system. Tuning a database in a costeffective manner is a growing challenge. The total cost of ownership (TCO) of information technology needs...

متن کامل

AutoAdmin: Self-Tuning Database SystemsTechnology

The AutoAdmin research project was launched in the Fall of 1996 in Microsoft Research with the goal of making database systems significantly more self-tuning. Initially, we focused on automating the physical design for relational databases. Our research effort led to successful incorporation of our tuning technology in Microsoft SQL Server and was subsequently also followed by similar functiona...

متن کامل

AutoAdmin Project at Microsoft Research: Lessons Learned

The AutoAdmin project was started at Microsoft Research in 1996. The goal of AutoAdmin is to develop technology that makes database systems more self-tuning and self-managing. Initially, we focused mainly on the physical database design problem. In subsequent years we broadened our focus and studied a number of other problems: database monitoring, execution feedback for query optimization, flex...

متن کامل

Vehicle Stabilization via a Self-Tuning Optimal Controller

Nowadays, using advanced vehicle control and safety systems in vehicles is growing rapidly. In this regard, in recent years new control systems, called VDC, have been introduced. These systems stabilize vehicle yaw motion, by yaw moment resulted from tire controlling forces. In this paper, an adaptive optimal controller applied to a vehicle to obtain a satisfactory lateral and yaw stability. To...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007